Add new index to postgres datastore for GC #1550

vroldanbet · 2023-09-28T09:03:47Z

This adds 1 new indexes to optimize:

GC operations

⚠️ important! This index will not be selected by the query planner in PG 13 and 14 due to the usage of the xid8 type

add tuned GC index

the original GC index was rarely selected
when GC'ing relations with millions of relationships.
This caused a sequential scan that could load potentially
Gigabytes worth of data, and overload the database.

⚠️ Postgres 13 and 14 do not correctly select this new index
because the query planner struggles with the xid8 datatype.
Postgres 15 will be needed for this index to
actually work. For scenarios where relations aren't very wide,
older PG versions may have an acceptable query latency for GC,
but we strongly recommend moving to PG15.

This migration also deletes the old suboptimal index after
the new one is created.

josephschorr · 2023-09-29T17:49:51Z

@vroldanbet I wonder if we should have a migration to run ANALYZE TABLE after the concurrent indexes are added to ensure the stats are correct for them?

vroldanbet · 2023-10-02T08:31:48Z

@josephschorr from this post:

A table of 40 Million rows, the auto vacuum will run when the table will receive 8 Million updates or deletes. Similarly the table needs to receive 4 Million updates or deletes in order for the auto analyze to start. Mostly tables of such size would become slow before this threshold is received, due to which manually VACUUM FULL ANALYZE is recommend on a regular basis.

This other post suggests that analyze would be necessary if index has an expression to it, which is the case here:

If the index is just on simple columns like in your case, it is not necessary to ANALYZE the table after you create the index.
That is because statistics on the value distribution of the table columns are always collected, no matter if there is an index on the column or not.

However, if you are indexing an expression like upper(some_column) or (CAST(some_column AS date)), you should run ANALYZE after creating the index.
PostgreSQL will then also collect statistics on the value distribution of the indexed expression. This happens automatically whenever autoanalyze runs, but it is a good idea to do it manually right after creating the index so you have good statistics right away.

My concerns here are:

is running ANALYZE effective if the index is being created CONCURRENTLY?
Could running ANALYZE to a PG running already a CREATE INDEX statement add sufficient load to disrupt service?

josephschorr · 2023-10-02T19:00:04Z

is running ANALYZE effective if the index is being created CONCURRENTLY?

Unlikely, but I believe the migration will wait for the index to complete (concurrently) before the next one is run, so the ANALYZE should apply afterwards

Could running ANALYZE to a PG running already a CREATE INDEX statement add sufficient load to disrupt service?

Yes, but one would imagine that CREATE INDEX is not also running unless someone is running the migration command twice (at the same time)

vroldanbet · 2023-10-03T09:22:19Z

@josephschorr

This blog post suggests running analyze should only be done when bulk loading, and that doing it in other situations is adding unnecessary pressure to the database:

Autovacuum also keeps a table’s data distribution statistics up-to-date (it doesn’t rebuild them). When manually run, the ANALYZE command actually rebuilds these statistics instead of updating them. Again, rebuilding statistics when they’re already optimally updated by a regular autovacuum might cause unnecessary pressure on system resources.

The time when you must run ANALYZE manually is immediately after bulk loading data into the target table. A large number (even a few hundred) of new rows in an existing table will significantly skew its column data distribution. The new rows will cause any existing column statistics to be out-of-date. When the query optimizer uses such statistics, query performance can be really slow. In these cases, running the ANALYZE command immediately after a data load to completely rebuild the statistics is a better option than waiting for the autovacuum to kick in.

Another post indicates that doing it after adding an index is actually a good idea:

You may want to run ANALYZE in these situations:

when the contents of a table has changed significantly. For example, when a few percents of records in a table have been added, updated or deleted.

just before or after adding an index to a table. That may help query planner to generate optimal query plans that will efficiently the new index.

I also tried to find evidences of whether ANALYZE was blocking. From the official docs:

Specifies that ANALYZE should not wait for any conflicting locks to be released when beginning work on a relation: if a relation cannot be locked immediately without waiting, the relation is skipped. Note that even with this option, ANALYZE may still block when opening the relation's indexes or when acquiring sample rows from partitions, table inheritance children, and some types of foreign tables. Also, while ANALYZE ordinarily processes all partitions of specified partitioned tables, this option will cause ANALYZE to skip all partitions if there is a conflicting lock on the partitioned table

This post suggests it does not block

While ANALYZE runs, other queries may access the table because the ANALYZE command does not block the table.

This other blog post also supports this idea:

Note that, in PostgreSQL, ANALYZE command doesn't read or update indexes. It deals only with table/column contents. It also doesn't block the table - other queries may read from the table while ANALYZE is running.

And FWIW, anecdotal evidence on a cluster with millions of rows show the index being picked immediately after run with CREATE INDEX CONCURRENTLY 🤷🏻

the original GC index was rarely selected when GC'ing relations with millions of relationships. This caused a sequential scan that could load potentially Gigabytes worth of data, and overload the database. ⚠️ Postgres 13 and 14 do not correctly select this new index because the query planner struggles with the xid8 datatype. Postgres 15 will be needed for this index to actually work. For scenarios where relations aren't very wide, older PG versions may have an acceptable query latency for GC, but we strongly recommend moving to PG15. This migration also deletes the old suboptimal index after the new one is created.

josephschorr

LGTM

vroldanbet requested a review from a team September 28, 2023 09:03

github-actions bot added the area/datastore Affects the storage system label Sep 28, 2023

vroldanbet self-assigned this Sep 28, 2023

jzelinskie added the area/perf Affects performance or scalability label Sep 28, 2023

vroldanbet mentioned this pull request Sep 29, 2023

Improve experience on pg serialization errors #1552

Merged

vroldanbet force-pushed the add-new-indexes branch from 9fe5986 to 84250fd Compare September 29, 2023 10:55

vroldanbet force-pushed the add-new-indexes branch from 84250fd to b3b5531 Compare October 2, 2023 12:51

vroldanbet enabled auto-merge October 2, 2023 12:51

vroldanbet force-pushed the add-new-indexes branch from b3b5531 to 13f6d3f Compare October 2, 2023 18:57

josephschorr changed the title ~~Add new indexes to postgres datastore~~ Add new index to postgres datastore Oct 2, 2023

vroldanbet force-pushed the add-new-indexes branch from 13f6d3f to 9025669 Compare October 3, 2023 09:10

vroldanbet force-pushed the add-new-indexes branch from 9025669 to 1e81e68 Compare October 3, 2023 09:31

josephschorr approved these changes Oct 3, 2023

View reviewed changes

vroldanbet added this pull request to the merge queue Oct 3, 2023

Merged via the queue into main with commit 9035a64 Oct 3, 2023
25 checks passed

vroldanbet deleted the add-new-indexes branch October 3, 2023 14:59

github-actions bot locked and limited conversation to collaborators Oct 3, 2023

vroldanbet changed the title ~~Add new index to postgres datastore~~ Add new index to postgres datastore for GC Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new index to postgres datastore for GC #1550

Add new index to postgres datastore for GC #1550

vroldanbet commented Sep 28, 2023 •

edited

Loading

josephschorr commented Sep 29, 2023

vroldanbet commented Oct 2, 2023

josephschorr commented Oct 2, 2023

vroldanbet commented Oct 3, 2023

josephschorr left a comment

Add new index to postgres datastore for GC #1550

Add new index to postgres datastore for GC #1550

Conversation

vroldanbet commented Sep 28, 2023 • edited Loading

add tuned GC index

josephschorr commented Sep 29, 2023

vroldanbet commented Oct 2, 2023

josephschorr commented Oct 2, 2023

vroldanbet commented Oct 3, 2023

josephschorr left a comment

Choose a reason for hiding this comment

vroldanbet commented Sep 28, 2023 •

edited

Loading